AI & Reinforcement Learning

Artificial Intelligence

Deep learning fundamentals and reinforcement learning, with an interactive Q-learning maze that demonstrates the agent–environment loop end to end.

Deep Learning

Multi-layer neural networks trained end-to-end with backpropagation.

Architectures, in one line each

MLP — stacked dense layers h_k+1 = σ(W_k h_k + b_k). Universal approximator; no built-in spatial or temporal structure.
CNN — weight-shared convolutions exploit translation equivariance; the workhorse for images.
Transformer — self-attention softmax(QK^T/√d) V replaces recurrence; the basis of modern LLMs and ViTs.

Backpropagation

Gradients propagate via the chain rule through the computation graph:

∂L/∂θ = (∂L/∂z) · (∂z/∂θ)

Modern frameworks (PyTorch, JAX) build the graph dynamically and apply reverse-mode autodiff. Training stability is then a matter of initialisation, normalisation (BatchNorm, LayerNorm), and residual connections.

GitHub projects

Statistics — jiwook021/Algorithms/Statistics
Machine Learning — jiwook021/Algorithms/AI/MachineLearning
Deep Learning / Computer Vision — jiwook021/Algorithms/AI/MachineLearning/Deep_learning
Reinforcement Learning — jiwook021/Algorithms/AI/ReinforcementLearning
Computer Vision — jiwook021/Algorithms/ComputerVision
ROS SLAM with C++ — jiwook021/ROS_SLAM_Projects

Tutorials

GeeksforGeeks Deep Learning Tutorial — geeksforgeeks.org/deep-learning-tutorial

Reinforcement Learning

An agent picks actions in an environment to maximise long-run reward. The maze below trains a tabular Q-learner from scratch in your browser.

Agent ↔ environment loop

At each step t the agent observes state s_t, chooses action a_t, receives reward r_t+1, and transitions to s_t+1. The objective is the discounted return:

G_t = Σ_k=0^∞ γ^k r_t+k+1,    0 ≤ γ < 1

Q-learning update

Off-policy temporal-difference control. The action-value Q(s, a) is updated toward a one-step bootstrap target:

Q(s, a) ← Q(s, a) + α [ r + γ · max_a′ Q(s′, a′) − Q(s, a) ]

Tabular Q-learning converges to the optimal Q* when every (s, a) is visited infinitely often and the learning rate α decays appropriately (Watkins, 1989).

ε-greedy exploration

Pick argmax_a Q(s, a) with probability 1 − ε, a uniform random action otherwise. Annealing ε across training shifts the agent from exploration to exploitation.

Q-Learning Maze Solver

Maze Size:

Learning Rate (α):

Discount Factor (γ):

Epsilon (ε):

Episodes:

Delay (ms):

Maze Environment

Start

Goal

Wall

Agent

Visited

Optimal

Training Stats

Episode: 0/0

Steps: 0

Rewards: 0

Success: 0%